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ABSTRACT 

The transcription factor forkhead box P2 (F0XP2) is 
believed to be important in tlie evolution of human 
speech. A mutation in its DNA-binding domain 
causes severe speech impairment. Humans have 
acquired two coding changes relative to the 
conserved mammalian sequence. Despite intense 
interest in F0XP2, it has remained an open 
question whether the human protein's DNA- 
binding specificity and chromatin localization are 
conserved. Previous in vitro and ChlP-chip studies 
have provided conflicting consensus sequences for 
the F0XP2-binding site. Using MITOMI 2.0 
microfluidic affinity assays, we describe the 
binding site of F0XP2 and its affinity profile in 
base-specific detail for all substitutions of the 
strongest binding site. We find that human and 
chimp F0XP2 have similar binding sites that are 
distinct from previously suggested consensus 
binding sites. Additionally, through analysis of 
F0XP2 ChlP-seq data from cultured neurons, we 
find strong overrepresentation of a motif that 
matches our in vitro results and identifies a set of 
genes with F0XP2 binding sites. The F0XP2-binding 
sites tend to be conserved, yet we identified 38 
instances of evolutionarily novel sites in humans. 
Combined, these data present a comprehensive 
portrait of F0XP2's-binding properties and imply 
that although its sequence specificity has been 
conserved, some of its genomic binding sites are 
newly evolved. 



INTRODUCTION 

FOXP2 is a transcription factor of interest in the develop- 
ment and evolution of language in humans (1). Broad 
interest in F0XP2 began with the discovery of its 
hnkage to autosomal dominant transmission of develop- 
mental verbal dyspraxia, a deficit of speech articulation, in 
the large KE family pedigree (2). The trait was linked to a 
locus on chromosome 7 and eventually to a single nucleo- 
tide (residue 553) residing in the DNA-binding domain of 
F0XP2, a member of the forkhead box family of 
sequence-specific DNA-binding proteins (2-6). Several 
unrelated cases having similar phenotypes were also 
identified and typically involved truncation events of the 
3'-end of the F0XP2 open reading frame (ORF) (2,7). 
Affected individuals have normal intelhgence and 
hearing but have jerky, dysfluent and disordered speech 
(8). FOXP2, therefore, offers an entry point into under- 
standing the molecular underpinnings of the development 
of patterned syntactic speech. 

Shortly after the KE phenotype was mapped to F0XP2, 
analysis of the gene's sequence conservation revealed an 
interesting evolutionary history, adding another dimen- 
sion to its importance in human speech. The mammahan 
sequence is well conserved except for two mutations in the 
human lineage (T303N and N325S), both N-terminal to 
the Zn-finger domain (Figure 1). Conservation analysis 
revealed an enhanced non-synonymous substitution rate 
in the hominid hneage, consistent with recent selection (9). 
In support of this idea, researchers found that F0XP2 
locus sequences from a diverse panel of human individuals 
contain an excess of high-frequency derived alleles and 
rare intronic alleles indicative of a selective sweep in 
human ancestors (10,11). Animal models expressing 
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either mutant FOXP2 or lower levels of wild-type protein 
have borne out the involvement of F0XP2 in vocalization 
in mice and in zebra finches (12-14). These results suggest 
that in addition to its developmental role in speech, 
FOXP2 may have had an evolutionary role in speech 
and language. 

Although there exist several possible paths for the 
molecular evolution of FOXP2 function between ancestral 
primates and humans, here we investigate the simple 
possibihties that the selected protein mutations in the 
human lineage could have altered FOXP2's-binding 
activity, driving novel targeting and functions; and/or 
that the genomic binding sites in humans could have 
changed, causing modulation of targeting strength and 
gain and loss of F0XP2-binding targets. 

Evaluation of these possibihties would be aided by a 
thorough understanding of the FOXP2 affinity profile, 
yet there is surprisingly poor agreement over the identity 
of the F0XP2 DNA-binding motif (Tablel). This poor 
agreement may be due to either the use of different experi- 
mental techniques or rehance on previous candidate 
motifs identified through studies of related proteins (e.g. 
FOXPl and FOXP3) (15-17). The lack of a consistent 
binding site model makes it difficult to predict targets by 
sequence analysis, which in turn complicates the task of 
defining evolutionarily novel target repertoires. 

Here, we clarify FOXP2's target motif using recently 
developed microfluidic methods that measure binding 
affinity of proteins to a library of different DNA 
sequences (18,19). The resulting detailed binding site 
model reveals essentially identical affinity profiles for the 
chimp and human FOXP2 orthologs, suggesting that evo- 
lutionary differences between lineages did not involve 
distinct binding preferences. The derived F0XP2 motif 
is corroborated by an unbiased search for overrepresented 
motifs within FOXP2-bound ChlP-seq peaks. We find 
that most motif sites are conserved, and they tend to be 
near other transcription factor genes. However, we also 
find instances of evolutionarily novel F0XP2 target 
binding sites, including genes involved in synaptic 



plasticity and neural development, suggesting that 
changes in cis regulation may underhe novel functions of 
FOXP2 in human language. 



MATERIALS AND METHODS 

Cloning, mutagenesis and expression 

Full-length F0XP2 coding sequence was initially 
amplified from HeLa cDNA (primers designed to 
isoform 1 Ensembl record CCDS5760, included in 
Supplementary Information) and placed into a PCR2.1- 
topo vector. Point mutations in the derived clone were 
corrected by site-directed mutagenesis (20). The sequence 
was confirmed by Sanger sequencing and assembly with 
phred/phrap. Chimp and human mutant R553H FOXP2 
coding versions were constructed by site-directed muta- 
genesis on this wild-type human plasmid. We removed 
the first 213 codons by PCR and added flanking 
promoter, polyA, and His tag sequences necessary for 
in vitro transcription/translation and MITOMI (see 
Supplementary Information). The truncation removed 
the long polyglutamine stretch at the beginning of the 
protein for improved expression and solubility 
(Figure 1). A similar truncation was previously used for 
electromobihty shift assay (EMSA) studies (21), as 
polyglutamine stretches of >40 residues are associated 
with misfolding and aggregation (22,23). The PCR 
products were purified by Promega Wizard gel purifica- 
tion and concentrated via vacuum centrifugation to 
~I40ng/|il. TnT® T7 coupled reticulocyte lysate kit 
from Promega with the addition of 10|iM ZnCl2 was 
used to produce the protein of interest. We included 3 ^1 
of Fluorotect Green BODIPY charged lysine tRNA in 
each 75 |rl translation reaction for detection of the 
protein by fluorescence. 

MITOMI mold and device fabrication 

MITOMI devices were made as described previously 
(18,19). Briefly, molds for devices were fabricated on 
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■ T runcated region ■ ^ ^ ^ His tag 

T303N N325S R553H 

Figure 1. Schematic of FOXP2 domains and truncated construct used in MITOMI experiments showing C2H2 zinc-finger domain, leucine zipper 
domain, forkhead box DNA-binding domain and histidine repeat epitope tag (6xHis). Human hneage substitutions are at positions 303 and 325. The 
R553H mutation hnlced to verbal dyspraxia lies within the DNA-binding domain. A polyglutamine (polyQ) stretch was removed by truncation of the 
shaded region. We 6xHis-tagged the C-terminus for recruitment and retention on chip. 



Table 1. Previously reported models of the FOXP2-binding site 



Publication 



Data type 



System 



Motif 



Vernes et al. (2007) 
Vernes et al. (2008) 
Enard et al. (2009) 
Vernes et al. (2011) 



ChlP-chip 
EMSA 

Gene expression 
ChlP-chip 



SH-SY5Y cells overexpressing FOXP2 
In vitro binding to CNTAP2 sequence 
Humanized mice 
Wild-type embryonic mice 



TCTTCGT 
AATTTG 
TATTTAT 
ARKTAMYT 
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4-incli silicon wafers by mask photolithography. Masks 
were based on previously pubUshed designs (18). The 
two layers of the device were made from RTV615 
PDMS casts from the silicon molds. After partial curing, 
the two layers were aligned and baked. The two-layer 
device was then aligned and bonded to an epoxy-silane 
glass substrate (CEL associates) with a printed array of 
the DNA library. Finished devices were run as described 
previously (18,19,24). 

DNA library design, synthesis and printing 

The full 740 ohgonucleotide pseudorandom Ubrary was 
designed with software from Eisen and Mintseris (25) to 
include all possible 65 536 8-bp DNA sequences in a rela- 
tively compact sequence space. This minimal string was 
then divided into 52mer ohgonucleotides. We ordered 
these single-stranded ohgonucleotides with a 3' 14-base 
adapter sequence to enable synthesis of the complemen- 
tary strand (IDT Coralville, I A, USA, Supplementary 
Table SI). A common labeled primer complementary to 
the common adapter (Alexa647-GTCATACCGCCGGA) 
was also ordered from IDT (Coralville, lA, USA). The 
second strand was synthesized with Klenow exo~ 
enzyme. For the targeted systematic mutation hbraries, 
double-stranded ohgonucleotides were synthesized by the 
same process and then serially diluted to final working 
concentrations of 0.001-2 |iM DNA. Printing was 
carried out with silicon tips on a contact printer. 
Libraries were resuspended to a final concentration of 
3x saline-sodium citrate buffer, with 0.125% polyethyl- 
ene glycol-6000 (Fluka) and 12.5mg/ml of D-(+)-trehalose 
dihydrate (Fluka). 

MITOMI data analysis 

In general, we followed the analysis protocol described 
previously (19). An array of DNA chambers holding the 
library of double-stranded oligonucleotides was situated 
next to an array of 'button valves' that trap the interaction 
between each ohgonucleotide sample and the protein of 
interest. At the end of each experiment, devices were 
scanned to measure fluorescence intensities using an 
arrayWoRx scanner with arrayWoRx 3.0.3 software 
suite release 1. Fluorescence data for bound DNA and 
protein at the button valve and free DNA in the DNA 
chamber were extracted from the scanned devices with 
Genepix 6.1. 

To identify initial lUPAC motifs preferred by FOXP2, 
we used fREDUCE software to screen all degenerate 
Nmers in the sequence library for their Pearson correl- 
ation to associated binding scores (26). Using ratios of 
bound DNA signal to protein signal at the button valve, 
fREDUCE was run to identify preferred 6mers through 
9mers with up to three degenerate positions. The bound 
DNA/protein ratio data were normalized to the highest 
observed ratio (displayed as "rNN") in all analyses 
except for comparison of binding strengths between 
wild-type and mutant constructs. The top scoring 
lUPAC sequences by correlation and /"-value with 
respect to the whole data set were then used as input 



'seeds' for MatrixREDUCE. Given a seed sequence, 
MatrixREDUCE searches for a local optimum position- 
specific affinity matrix (PSAM) that best fits the measured 
binding data (27). MatrixREDUCE was also run on ah 
the random hbrary-binding data without any initial seed 
sequence, to remove any potential for bias introduced by 
the constraints of the lUPAC motif representation. 
MatrixREDUCE results were then scored against the 
whole data set by Pearson's correlation between the 
observed and expected occupancies. PSAM motif logos 
were made with AffinityLogo software (27). 

Binding curves were fit to a hyperbolic saturation curve 
with global non-hnear regression in Graphpad Prism 4.00. 
A dilution series of the fluorescently labeled primer was 
used as a standard curve to cahbrate the relationship 
between fluorescence intensity and free DNA concentra- 
tion on the devices. 

Chromatin IP data 

Processed ChlP-seq data from the Myers laboratory at 
Hudson Alpha were downloaded from the ENCODE 
portal of the UCSC Genome Browser (http://genome. 
ucsc.edu/cgi-bin/hgFileUi?db = hgl8&g = wgEncodeHaib 
Tfbs). The data were derived from chromatin inimunopre- 
cipitation libraries from PFSK-1 and SK-N-MC cefls 
using an antibody that recognizes the C-terminal 127 
amino acids of FOXP2 (28). Cross-hnked and sheared 
chromatin samples were sequenced and compared with 
hbraries prepared without any immunoprecipitation. We 
used the peaks called by the Myers laboratory using 
QuEST, which collapses ChlP-seq signal from both 
strands of DNA and then calculates a fold enrichment 
of the peaks over the no immunoprecipitation control 
(29). There were two biological replicates from each cefl 
fine. We used the function findOverlappingPeaks in the R 
Bioconductor ChlPpeakAnno package (http://www. 
bioconductor.org/packages/release/bioc/html/ 
ChIPpeakAnno.html) to first merge the replicate peaks 
within data from each ceU fine to form a set of 1483 
peaks. For convenience, we wiU refer to the overlapping 
peak set of 1483 as 'replicate peaks'. We merged these 
replicate peaks to form a set of 71 high-confidence peak 
sequences that gave strong signals across all samples from 
both ceU lines. The peaks' positions relative to the nearest 
genes, regardless of gene orientation, were annotated 
using the annotatePeaklnBatch function of ChlPpeak- 
Anno for genome build NCB136. We determined GO 
term enrichment using the getEnrichedGO function of 
ChlPpeakAnno with maximum /"-value of 0.05 after 
adjusting for multiple testing (30). 

Motif searching in ChlP-seq peaks 

For the set of 1483 replicate peaks and the set of 71 high- 
confidence peaks, we extracted the genomic sequences plus 
50 or 200 extra nucleotides on each end of the full peak 
sequence. These sequences were passed to MEME version 
4.3.0, which output PWMs ranked by their E-values for 
representation in the set of positive sequences (31). The 
input parameters specified a minimum motif width of 8 bp. 
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a maximum motif widtli of 50 bp, a minimum of two sites 
and an E-value tliresliold of 1 E-50. 

We also used the MITOMI-derived 7mer PSAM to 
score motifs within the 71 high-confidence peaks. For 
this analysis, we calculated the predicted occupancy 
ratio relative to the strongest sequence for 7mer 
windows across the entire ohgonucleotide sequence (27) 
and then compared the score for the highest scoring 
window with the distribution of scores for all 7mers. We 
identified candidate target sites of interest by using a score 
threshold of 0.06, which returns the top 0.1% of 7mer 
scores. 

Conservation analysis 

We hypothesized that FOXP2 motifs inside the ChlP-seq 
high-confidence peaks would exhibit elevated conserva- 
tion relative to the surrounding sequence. Using the best 
PWM from our MEME analysis, we searched replicate 
peak regions for FOXP2-binding sites using TAMO vl.O 
(32) to identify predicted binding sites. We selected a 
threshold of 90% of the maximum bit score to yield ap- 
proximately one F0XP2 motif per ChlP-seq peak. The 
same approach was taken for the 71 high-confidence 
peaks and the larger hst of 1483 rephcate peaks. From 
1483 peaks, we identified 472 that contained at least one 
instance of a FOXP2 motif within our 90% of maximum 
score threshold as scored by TAMO. We then determined 
conservation scores for windows extending 100-bp 
upstream and downstream of each predicted binding site 
using the UCSC phastCons44WayPrimate ahgnment 
score file (http://hgdownload.cse.ucsc.edu/goldenpath/ 
hgl8/phastCons44way/primates/). We used these to 
compute both an ensemble average of conservation and 
the principal components of conservation (using the R 
prcomp package) in the region centered on each predicted 
transcription factor-binding site (TFBS). 

To find novel FOXP2-binding targets among the 
human ChlP-seq peaks, we searched the merged replicate 
peaks that had strong binding sites. Of these, 38 contained 
sites with a substantial reduction in predicted FOXP2 
affinity (50% or less of maximum bit score) between 
human and chimp sequences. By analyzing the UCSC 
multiz44way ahgnment (http://hgdownload.cse.ucsc.edu/ 
goldenPath/hgl8/multiz44way/) of these 38 sites across 
human, chimp, gorilla, rhesus, marmoset, tarsier, mouse 
lemur and bushbaby, we identified 22 sites for which the 
changes are unique to the human lineage. 

RESULTS 

Human R553H mutant shows no binding activity 

Previous EMSA studies did not detect binding of the 
R553H mutant to SV40 DNA sequence (21). These 
results are consistent with two possibilities: the mutant 
could lack DNA-binding activity, or the mutant could 
have altered target site specificity. To distinguish 
between these possibilities, we used a microfluidic- 
binding assay (MITOMI 2.0) to search for binding inter- 
actions between the mutant protein and a DNA library 
containing all possible 8-bp sequences. In brief, MITOMI 



2.0 experiments measure affinities between a single 
BODIPY-labeled transcription factor and many Alexa- 
647 or Cy5-labeled DNA sequences in parallel; the 
measured DNA signal intensity normalized by the 
protein signal intensity provides a measure of the frac- 
tional protein occupancy at a given DNA concentration. 
Truncated human R553H protein gives essentially zero 
protein occupancy signal for all assayed sequences 
(Figure 2A). Our data, therefore, suggest that R553H 
has lost all DNA-binding activity and not just the ability 
to bind its normal motif 

Chimp and human proteins produce similar patterns 
of binding 

In contrast to the R553H mutant, the protein occupancy 
signal distribution for truncated chimp and human 
FOXP2 proteins contains a tail indicative of strong 
binding to a subset of DNA sequences (Figure 2A). 
Comparing the binding pattern of chimp and human 
truncated F0XP2 protein, it is clear that some probes 
are repeatedly bound, e.g. oligonucleotide #175, whereas 
most ohgonucleotides exhibit low binding (Figure 2A and 
B). The binding patterns for the two proteins are similar 
(Pearson's of 0.85, Figure 2B). 

Chimp and human orthologs bind similar motifs 

Identifying the preferred motif that correlates with 
binding to the hbrary of DNA sequences requires 
analysis because each 52-bp ohgonucleotide contains 
many potential binding sites (19). To identify these 
target sites, we first used fREDUCE, which identifies 
preferred motifs based on the correlation between 
measured binding intensity and the presence of subse- 
quences within each oligonucleotide and searched for 
preferred motifs between six and nine nucleotides in 
length (26). fREDUCE returns fists of degenerate con- 
sensus sequences ranked by their correlation to the 
observed pattern of binding to the DNA hbrary. To de- 
temiine the effects of nucleotide substitutions at each 
position within these target sites, we subsequently used 
MatrixREDUCE, which fits a local optimum PSAM to 
the observed pattern of binding (27). Supplementary 
Table S2 hsts preferred sequences obtained from 
analysis of four aggregated experiments for each 
protein (Supplementary Table S3 hsts predictions 
from individual experiments). As expected, the similar 
binding patterns observed for the chimp and human 
proteins produce similar enriched motifs (Figure 2C 
and D and Supplementary Table S2) (Supplementary 
Figure SI shows similar results obtained through 
MatrixREDUCE queries without an initial lUPAC 
seed sequence). The top motifs of different lengths are 
essentially nested versions of the same motif for the 
human and chimp protein, each containing a core 
TGTTKAC sequence. In summary, the chimp and 
human FOXP2 bind similar DNA ohgonucleotides in 
our hbrary and seem to prefer similar motifs. 
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Figure 2. Results from FOXP2 MITOMI 2.0 binding assays against a pseudorandom 8mer library. (A) Histograms of MITOMI data showing the 
ratios of DNA signal intensities to protein signal intensities for human R553H mutant, human WT and chimpanzee alleles. R553H shows no binding 
to any sequence in the library, whereas chimp and human FOXP2 produces strong binding to a subset of oligonucleotides. (B) Comparison of chimp 
and human binding ratios (rNN) for all oligonucleotides in the DNA library. Oligonucleotide #175 (used for later targeted analysis) is labeled in red. 
(C) Top scoring human MatrixREDUCE 7mer affinity logo generated using AffinityLogo (27). The height of each letter depicts the predicted 
energetic cost or benefit (AAG/RT) of a particular nucleotide at that position in the motif The centerline indicates zero energetic change. (D) Top 
scoring chimp MatrixREDUCE 7mer affinity logo. 



Systematic mutation of the binding motif provides 
base-specific affinity information 

To experimentally confirm our prediction that chimp and 
human FOXP2-binding preferences are the same at the 
single-nucleotide level, and to explore the effects of 
flanking nucleotides on affinity, we measured affinities 
for FOXP2 constructs interacting with a series of ohgo- 
nucleotides containing single-nucleotide substitutions. For 
this targeted binding curve hbrary, we chose to use the 
13 bp containing a candidate high-affinity binding site 
within a strongly bound oligonucleotide (#175) as a refer- 
ence sequence. We then designed 39 DNA sequences with 
all possible point mutations of this 13bp sequence within 
the context of the larger unchanged oligonucleotide 
(full DNA sequences in Supplementary Table S4). We 



programmed the MITOMI device with a dilution series 
of each ohgonucleotide and measured FOXP2 binding 
over the series. These experiments allowed us to calculate 
apparent KaS by non-Hnear regression of the binding 
curves for all oligonucleotides (18). 

Figure 3A plots the fold change in the KaS for each 
motif variant for both truncated chimp and human 
versions of FOXP2 derived from analysis of individual 
binding curves (example curves for all oligonucleotides 
are shown in Supplementary Figures S2 and S3). The 
bulk of the sequence specificity lies in a 7-bp core motif, 
with relatively minor contributions outside of that core. 
Although a number of point variants (e.g. Ty4TTTAC and 
TGTTTAT) are permissive for binding, with KaS only 
3-fold lower than the strongest sequence, other point 
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Figure 3. Affinity measurements for systematic mutations of the binding site and flanking sequences. (A) Fold cliange in affinity (mutated KJ 
unmutated K^) shown in log-scale. At every position, three values are shown for substitutions with each alternate base relative to the starting 
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energetic cost or benefit (AAG/RT) of adding that base at that position in the motif The centerline indicates zero energetic change. (C) PSAM 
affinity logo based on the affinities displayed in part A for the chimp allele. 



variants (e.g. TGTT/IAC) are clearly disfavored, with KaS 
> 100-fold lower than the strongest sequence. Taking these 
measurements together, we constructed an improved 
position-specific affinity matrix (PSAM) that reflects the 
experimentally observed effects of each point mutation at 
each position (Figure 3B and C) (PSAM matrices are dis- 
played in Supplementary Table S5). 

In agreement with the pseudorandom library measure- 
ments, we find that the pattern of affinity is similar across 
the two proteins (Student's /-test on the mean and 
standard deviation of the separate experiments, 
P > 0.05), confirming that there is essentially no difference 
in binding preferences between species. In addition, these 
derived motifs are in agreement with the motifs obtained 
via pseudorandom Ubrary measurements, except for 
sUghtly less G/T degeneracy at the fifth position in the 
7mer. These motifs represent the most detailed in vitro 
description of the specificities of the FOXP2-binding 
sites to date and are distinct from all previously reported 
FOXP2-binding motifs (Table 1). 

MITOMI results match an independently derived FOXP2 
motif from ChlP-seq data 

In parallel with our MITOMI efforts, we analyzed in vivo 
FOXP2 DNA-binding data from human neuronal cell 



fines from the Myers laboratory released to the public as 
part of the ENCODE consortium (28,33). To study only 
the most reproducible ChlP-seq signal enrichment peaks, 
we first identified overlapping ChlP-seq signal peaks 
within the biological replicates for each ceU Une, yielding 
1238 overlapping peaks (of 5111 peaks) for the PFSKl 
cells, and 316 overlapping peaks (of 615 peaks) for the 
SK-N-MC cells. As noted in the 'Materials and 
Methods' section, we wiU refer to this overlapping peak 
set as 'replicate peaks'. Next, we narrowed this set to 
consider only those peaks that were shared between cell 
fines, yielding 71 high-confidence peaks. 

To these 71 high-confidence ChlP-seq peaks, we added 
50 bp of the flanking genomic sequence and searched 
for enriched sequence motifs using MEME (31). The re- 
sulting top position weight matrix [E- value = 4.5E-82, 
relative entropy 13.7 bits (34)] is similar to the motif 
found using our MITOMI device (motif matrix logo 
Figure 4A, compare with Figure 3B). Extending our 
search set to all repficate peaks within each ceU fine 
generated similar results (Supplementary Figure S4). 
When including a wider sequence window of 200 bp 
around each peak, MEME returned 73 instances of a 
similar motif (E-value = 2.6E-51), but also identified 55 
instances of a long putative homopolymer G/C stretch 




Position (bp) 



B Distance to nearest gene model (bp) 




Distance to nearest gene (bp) 

Figure 4. FOXP2 target-binding motif as revealed by ChlP-seq 
analysis. (A) Motif derived from MEME analysis of 71 ChlP-seq 
peak sequences with 50 bp of flanking sequence included. Motifs are 
displayed with small sample correction error bars. (B) Histogram of the 
relative positioning of the 71 FOXP2 ChlP-seq sites relative to the start 
of the nearest neighboring gene. 



(E-value = 2.4E-56, Supplementary Figure S5). Among all 
high-confidence peaks, 47/71 (63%) contained at least one 
instance of our strongest predicted motif string, and 65/71 
(92%) peaks contained a local PSAM motif score within 
the top 0.1% of all possible 7mers (Table 2 and 
Supplementary Table S6). 

FOXP2 ChlP-seq peaks have a stereotyped position and 
flanking sequence bias 

To better understand the regulatory relevance of these can- 
didate FOXP2 sites, we investigated their location relative 
to nearby genes. The FOXP2 sites tend to cluster near the 
start of the closest gene model, with over half of the ChlP 
peaks occurring within 1 kb of a transcriptional start site 
(TSS) (Figure 4B and Supplementary Figure S6). 
Additionally, nucleotide bias calculations across the 
regions flanking the FOXP2-binding site revealed a G/C 
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bias on both sides of the FOXP2 motif instance 
(Supplementary Figure S7). This might explain the low in- 
formation content G/C biased sequences identified in our 
MEME searches over the region surrounding ChlP-seq 
peaks (Supplementary Figure S5). 

Characterization of predicted FOXP2 target genes 

To functionally characterize these sites, we mapped 
nearby genes. Fifty-eight genes are within 5kb of the 71 
high-confidence ChIP sites, and 1049 annotated genes are 
within 5kb of the replicate ChlP sites (Table 2 and 
Supplementary Table S7). Bioconductor GO term 
querying of these nearby genes returned several terms 
relating to other transcriptional regulators with strong 
/"-values, suggesting that FOXP2 tends to target other 
transcription factor genes (Tables 3 and Supplementary 
Table S6). Genes near FOXP2 sites in the hst that fit 
this description are ZBTB16, NFIA, TBLIX, ZNF395, 
CITED2, JUNE, CBX7, FOXPl, FOXKl, NR3C1 (gluco- 
corticoid receptor) and FOXPl itself A binding site near 
the FOXP2 gene is nearest to the start of the non-coding 
F0XP2 transcript NR_033766.1, annotated as a candidate 
for nonsense-mediated decay. The enrichment of tran- 
scription factors in FOXP2's putative target repertoire 
suggests that FOXP2 could act as a master regulator 
during development. 

Beyond our unbiased target search sets, we searched for 
FOXP2 ChlP-seq peaks upstream of candidate targets, 
including previously suggested binding partners that we 
did not detect in our higher stringency hst. Potential inter- 
acting forkhead box proteins F0XP4 and F0XJ2 are the 
closest annotated genes to two intergenic peaks 13-14 kb 
upstream with conserved strong matches to our motif (35). 
In addition, HDAC2, encoding a histone deacetylase that 
interacts with FOXP2 (35), has two upstream peaks. We 
find two strong FOXP2 localization peaks within intronic 
sequence of the gene that encodes CTBPl, which 
complexes with F0XP2 in yeast two-hybrid and co- 
imniunoprecipitation assays (36). However, the genes for 
NFATC2, GATAD2B, SFTPC, CCIO and IL6 (35,37) 
encoding other annotated targets or binding partners 
have no strong FOXP2 ChlP-seq peaks. Overall, these 
data suggest that FOXP2 may engage in feedback regula- 
tion of several of its annotated binding partners. 

Sequence conservation at FOXP2 localization peaks 

We expected a high degree of conservation for function- 
ally important elements within ChlP-seq peak regions. 
Using the NCBI36 UCSC phyloP scores for site-specific 
conservation of multiple ahgned sequences (38), we 
observe that 51 of our 71 high-confidence peaks overlap 
well-conserved loci. Likewise, the average of ahgned 
phastCons scores (39) reveals an increase in conservation 
centered on the predicted FOXP2-binding sites, as does 
the first principle component of the phastCons scores 
(Figure 5). This pattern holds true if we extend our 
analysis to the 472 replicate peaks that contain a motif 
scoring within 90% of the top motif PWM score 
(Supplementary Figure S8). Such elevated conservation 
further suggests the motif we identified is evolutionarily 
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Table 2. Continued 



Peak Max PSAM Top 'TGTTTAC Nearest Description ReKeq# Distance Intronic? 
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Peaks within 5 kb of either end of a gene model are shown along with PSAM motif scores. Max PSAM score refers to the maximum local alignment 
score. If the PSAM score is in the top 0.1% of score for random 7mers then it is noted in the "Top 0.1%' column. The "TGTTTAC" column notes 
whether the peak contains the consensus TGTTTAC. Nearest Gene, description and RefSeq# characterize the gene model nearest each peak. The 
nucleotide distance to TSS is sometimes >5 kb because some peaks are downstream of the gene model, and some are within large introns, as noted in 
the last column (intergenic peaks are described in Supplementary Table S6). 



Table 3. Gene ontology term analysis of consistent peaks from the ENCODE ChlP-seq data, with number of genes in each category noted 
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salient and suggests that there is a set of conserved target 
sites for FOXP2 throughout vertebrates. 

Human-specific exceptions to this generally strong con- 
servation at the binding site could provide insight into 
F0XP2's function in human-specific phenotypes. 
Therefore, we sought to identify particular F0XP2- 
binding targets with poorly conserved binding sites 
among the 71 high-confidence target loci. The sites near 
HSF2BP and PCMTD2 localization peaks contain in- 
stances of our strongest binding sequence that are 
unique to the human lineage. HSF2BP (heat shock 
factor 2-binding protein) is known to bind the develop- 
mental transcription factor HSF2, which is required for 
normal brain development (40,41). A single base change in 
humans was responsible for creating a new strong binding 
site in the seventh intron of HSF2BP. PCMTD2 (protein- 
L-isoaspartate 0-methyltransferase domain containing 2) 
is an aspartate and asparganine repair enzyme, and mice 
lacking this enzyme have increased brain size, abnormal 
arborization of pyramidal neuron dendrites and die early 
of progressive epilepsy (42,43). In the second intron of 
PCMTD2, an 18-bp deletion created a new strong 
F0XP2-binding site. These examples sparked our 
interest in the cis evolution of FOXP2-binding sites. 



To conduct a broader survey of strong FOXP2 c/.v-regu- 
latory binding sites that may be evolutionarily novel, we 
investigated ChlP-seq peaks that were consistently 
identified within either the PFSK-1 or SK-N-MC ceU 
hues and contained perfect matches to the strongest 
binding sequence. We found 38 instances of changes in 
sequence between chimp and human resulting in strong 
binding sites within replicate ChlP-seq sites. Of these, we 
discarded 16 sites in which the chimp sequence alone 
seemed to have acquired inutations relative to the mam- 
mahan consensus, leaving 22 sites of interest (Table 4). 
Roughly half of these events involve an insertion or 
deletion and the rest involve one or more point mutations. 
In aU, 63% (10/16) of the nearby genes have brain-specific 
functions (annotated in gray in Table 4) and several may 
have direct roles in neuronal function. 

For example, we find sites near the genes encoding gap 
junction protein delta 2 (GJD2), consortin (ClorfVl) and 
neuronal calcium sensor 1 (NCSl), both of which are 
involved in neuronal signal transduction. GJD2 forms a 
class of electrical synapses that modulate the firing pattern 
of neurons during development (44,45), and gap junction 
assembly requires consortin (46). At chemical synapses, 
NCSl modifies synaptic activity in response to calcium 
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Figure 5. Conservation of sequences within ChlP-seq peaks containing instances of the FOXP2 motif. (A) Example of two FOXP2 ChlP-seq peaks 
aligned with elements of strong conservation upstream of BACCHl on Chrl7: 79 366 750-79 370 250 (hgl8 / NCBI36). AHgnments are shown for two 
high-confidence peak regions with high scoring instances of our MEME motif, and the vertebrate conservation score for the underlying sequence. 
(B) The mean of the phastcons conservation score over the FOXP2 peak regions is displayed relative to the position of the strong FOXP2-binding 
motif, with the genomic background average conservation score in red. The first principal component of the phastcons conservation is plotted in blue 
on the same scale, noted on the right-hand axis. 



current, with broader roles in plasticity and spatial 
memory tasks in mice (47,48). These candidate novel 
target genes could have been important to the evolution 
of the FOXP2 regulon in humans. 



DISCUSSION 

An accurate and precise binding site model provides a 
useful tool to study FOXP2's evolution and molecular in- 
volvement in the development of language. Despite intense 
interest, the true binding preferences of FOXP2 have 
remained a mystery, with different experimental techniques 
yielding different candidate consensus sites (Table 1). To 
clarify FOXP2's binding site preference, we produced 
detailed models of the binding site from independent 
microfluidic affinity cell free assays (Figure 3) and 
neuronal cell-based ChlP-seq data sets (Figure 4). We 
find that the human and chimp FOXP2 in vitro binding 
profiles are virtually identical, featuring the same 
degeneracies at the same positions. The in vitro MITOMI 



data provide additional information about the penalties of 
a given substitution, whereas the ChlP-seq data provide 
clues to genomic targets in a more physiological setting. 
Using our in vitro derived motif to identify candidate 
FOXP2-bindiiig sites, we find 18 ChlP-seq peaks with 
binding sites that would have been missed by a strict 'TG 
TTTAC consensus sequence search. We identify several 
human-specific FOXP2-binding sites that may contribute 
to the evolutionarily novel role of F0XP2 in language. 

In addition to the strong siinilarity between our inde- 
pendently derived motifs, other observations suggest that 
we have identified a relevant FOXP2-binding motif First, 
our motif is consistent with the accepted RYMAAYA 
non-FOXP Forkhead box family theme (49,50). Second, 
conservation scores within ChlP-seq peak regions tend to 
peak at the exact location of our predicted binding sites. 
Together, these independent lines of evidence suggest that 
we have resolved the functional FOXP2-binding motif, 
modeled both in terms of positional affinity effects and 
positional frequencies among bound sites. 
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Additional analysis demonstrates that the motif derived 
here improves consistency with previous F0XP2 ChlP- 
chip data (17). Our core motif, modeled as a 5mer 
TGTKK for the sake of comparison, is overrepresented 
in the most significant ChIP probes, whereas the previ- 
ously suggested ATTTG motif is present at the level of 
expectation (Supplementary Figure S9). Nucleotide biases 
can complicate motif search algorithms and may explain 
some of the previous controversy surrounding the binding 
site. There is a G/C bias in the most highly enriched ChlP- 
chip probes, perhaps because of a tendency for FOXP2 to 
bind sites near TSSs within CpG islands (51). 

Encouragingly, orthologs of the genes we identify as 
hkely binding targets of FOXP2 also have altered 
patterns of expression and Foxp2 ChlP-chip signal as 
shown in previous experiments. Vernes et al. (52) profiled 
expression in wild-type and Foxp2 32 IX mutant mice and 
returned a list of 19 genes that had both ChlP-chip signal 
and significant expression changes. We found that 17 of 
these 19 genes have a peak within 5kb in at least one 
sample in the human ENCODE ChlP-seq data; 
ALCAM, CCK, CSDEl, EBF2, GNAL, GNAS, 
MAPK8IP3, MASTl, NEGRI, NRNl, PLAGl, PSME4, 
SFXN4, TCF12, TGFBI, CITED2 and COL24A1. An 
especially interesting target candidate from this hst is 
CITED2 (Cbp/p300-interacting transactivator, with Glu/ 
Asp-rich carboxy-terminal domain, 2), which modulates 
recruitment of the pBOO histone acetyltransferase to pro- 
moters resulting in remodehng of the chromatin locus (53) 
and modification of FOXO proteins (54). C1TED2 and 
these other genes seem to be reproducible FOXP2 
targets as observed by independent researchers, with 
both activating and repressing outcomes (52). 

From the ENCODE ChlP-seq data, we produced hsts 
of consistent localization targets in neuronal cell lines and 
found that F0XP2 binds near genes encoding DNA- 
binding proteins, such as glucocorticoid receptor and 
other forkhead box proteins. The set of putative targets 
includes an alternative transcript of F0XP2 itself and the 
gene encoding its annotated binding partner FOXPl. The 
ChlP-seq association with FOXPl is interesting because 
disruptions of these genes produce phenotypes with 
similar characteristics (55) and can cooperatively 
regulate reporter constructs (36,56). We speculate that 
autoregulation of the FOXP2 circuit may prove important 
to F0XP2's developmental function. In support of this 
hypothesis, FOXP2 is thought to be part of a co-expressed 
network of genes having a higher degree of connectivity in 
humans than in chimp and macaque (57). These themes 
are consistent with the idea of F0XP2 as a regulator of 
transcriptional regulators. 

Regarding the question of FOXP2's functional evolu- 
tion, our data suggest that some of the genomic binding 
sites have evolved, while the DNA-binding specificity of 
FOXP2 has been conserved. The FOXP2 PSAM motif 
and binding sites show a high degree of conservation in 
both biochemical affinity measurements and sequence 
alignment at ChlP-seq peaks. This pattern of broad 
target site conservation suggests that there is a core set 
of FOXP2 targets in vertebrates, with a limited but inter- 
esting set of changed targets in humans. We have observed 



22 potential examples of such cis evolution. These may 
represent newly acquired regulatory targets for human 
FOXP2 (Table 4, e.g. NCSl a synaptic calcium sensor 
involved in synaptic plasticity). Importantly, the FOXP2 
bound genes hsted here should not be considered an au- 
thoritative list. Rather, they were primarily used to 
analyze the binding site, and candidate binding sites 
were investigated for potential instances of evolution in 
humans. However, with a comprehensive binding site 
model, we can now improve our lists of direct FOXP2 
targets and better understand how its regulon may have 
changed over evolution. Future work of interest may 
include investigation of the differential protein-protein 
interactions of the chimp and human F0XP2, and gener- 
ation of chimp FOXP2 ChlP-seq data for comparison 
with the existing mouse and human data sets. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Tables 1-8 and Supplementary Figures 
1-9. 
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